This chapter of R for Health Data Science by Dr J H Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
Introduction
In this last notebook we will consider the workflow for a simple public health / biomedical science research project.
A R markdown file or a Quarto document is the ideal way to create a research document, either for your own purposes or within a research team. All the analysis is clearly visible to everyone and the analysis is reproducible. If the results of the analysis is submitted for publication or presentation and their are any requests to view the analysis, a R markdown file or Quarto document will allow other to verify and understand your work.
As R is open-source, the review of analysis can be down to machine language instruction level. This cannot be said of any proprietary software such as SAS, where we have to take the word of a corporation that the code is correct. Plus, we have to pay them a lot of money for that promise.
Setting up the analysis
Two important parts of starting an analysis, is to import the required libraries and to set the working directory.
Import libraries
Below, we import the libraries that we will use in our research project.
Set working directory
MacOS refers to folders and Windows and Linux refer to directories on a computer. The setwd function in R sets the working directory. This allows us to easily reference files in the current working directory. This directory can be set as a character object. It is more common to pass the getwd function as argument. The getwd function returns the folder / directory that the current file (R markdown file / Quarto document) is located in.
It is then the simplest to keep data files in the same directory / folder as the R file.
Importing the data
View have examined how to import data, either using the fread function from the data.table library, read.csv from base R, and read_csv from the tidyverse readr library. Below, we import a csv file which we will use in this notebook using the read_csv function.
Inspecting the data
We note from the Environment panel in the top-right of the RStudio interface that the dfr data object contains data on \(301\) observations across \(6\) variables. The same information is returned using the dim function.
The dfr data object contains missing values. With \(301\) observations and \(6\) variables, we should have \(301 \times 6 = 1806\) data points. The is.na function returns Boolean values, with TRUE if a data point is missing and FALSE if it present. Below, we pass it to the sum function. We remember that TRUE is stored internally as the number \(1\). Summing over all the Boolean values will return the number of missing data points. If we divide this by the total number of data points, we get the proportion of missing data.
We note that less than \(1\%\) of the data is missing. It is not uncommon to have missing data and we must be able to deal with it.
Below, we use the str function to inspect more metadata.
spc_tbl_ [301 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ age : num [1:301] 63 67 67 37 41 56 62 57 63 53 ...
$ sex : num [1:301] 1 1 1 1 0 1 0 0 1 1 ...
$ restbp: num [1:301] 145 160 120 130 130 120 140 120 130 NA ...
$ chol : num [1:301] 233 286 229 250 204 236 268 354 254 203 ...
$ maxhr : num [1:301] 150 108 129 187 172 178 160 163 147 155 ...
$ cad : num [1:301] 0 1 0 0 0 0 1 0 1 0 ...
- attr(*, "spec")=
.. cols(
.. age = col_double(),
.. sex = col_double(),
.. restbp = col_double(),
.. chol = col_double(),
.. maxhr = col_double(),
.. cad = col_double()
.. )
- attr(*, "problems")=<externalptr>
We see NA values. These are the missing data points.
As mentioned, we have \(14\) variables, one in each column of the data set in accordance with the principles of tidy data. The column \(\texttt{cad}\) contains data on the response or target variable, with levels \(0\) if no coronary artery disease is present, and \(1\) if coronary artery disease is present.
The properties of the other \(13\) variables are tabled below. (ECG is electrocardiogram.)
| Variable | Explanation | Properties |
|---|---|---|
| \(\texttt{age}\) | Age of participant | years |
| \(\texttt{sex}\) | Binary gender | 0 if female and 1 male |
| \(\texttt{restbp}\) | Resting systolic blood pressure | mm of mercury |
| \(\texttt{chol}\) | Serum total cholesterol | mg/dL |
| \(\texttt{maxhr}\) | Maximum resting heart rate | Beats per minute |
The original data set from which our illustrative data set was taken, had many more variables. In practice, we have defined research questions and during the planning phase of a project we create a data collection tool to collect data on only the required variables to answer our research question(s). The rest of the data in a data set is redundant. (Note: In many cases a project contains many more variables. This makes future analysis for other projects possible or is simply part of a larger data gathering project.)
Research questions
As illustrative example of how to use R is a data analysis project, we will state a primary research question and secondary research question.
Primary research question
Are \(\texttt{age, sex, restbp, chol,}\) and \(\texttt{maxhr}\) associated with \(\texttt{cad}\)?
Secondary research questions
- Is there a difference in the numerical variables between those with and without coronary artery disease?
- Is binary sex associated with coronary artery disease?
- Is there correlation between age and cholesterol values?
Any data analysis project begins with summary statistics and data visualization.
Summary statistics
The psych library has a useful describe function that performs summary statistics. We start with the \(\texttt{age}\) variable.
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 301 54.57 9.04 56 54.76 8.9 29 77 48 -0.22 -0.55 0.52
We see the number of observations, n, the mean, standard deviation, sd, median, trimmed median, trimmed, the mean absolute deviation, mad, the minimum, min, the maximum, max, the range, the skewness, skew, the kurtosis, and the standard error, se.
The describeBy function has a group argument that takes a factor (categorical) variable as value. The summary statistics of the numerical variables is then performed for each of the levels of the factor variable. Below, we perform summary analysis of the \(\texttt{age}\) variable by the two levels of the \(\texttt{cad}\) variable. Since the latter was encoded with a \(0\) and a \(1\), we can use the factor function to set labels for the two levels, using the labels argument.
# Summary statistics of age by levels of coronary artery disease
psych::describeBy(
dfr$age,
group = dfr$cad
)
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 218 53.45 9.23 54 53.49 10.38 29 76 47 -0.08 -0.64 0.62
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 83 57.51 7.86 58 57.97 5.93 38 77 39 -0.46 -0.02 0.86
We note that there seems to be a difference between the mean ages of those without and those with coronary artery disease. We are already starting to understand the information in the data, before any test for inference has been performed.
Next, we look at the comparative summary statistics of the resting blood pressure, with the comparison once again along the levels of the \(\texttt{cad}\) variable.
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 215 130.04 16.82 130 129.16 14.83 94 192 98 0.58 0.59 1.15
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 83 135.52 19.4 132 134.01 17.79 100 200 100 0.77 0.58 2.13
Below, we repeat the process for the \(\texttt{chol}\) and \(\texttt{maxhr}\) variable too.
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 218 244.23 51.09 240 241.38 43.74 126 564 438 1.41 6.49 3.46
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 82 253.59 55.03 254 252.15 54.11 131 409 278 0.31 0 6.08
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 215 154.84 21.13 158 156.32 20.76 88 202 114 -0.64 0.12 1.44
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 83 134.64 21.46 136 135.25 20.76 71 182 111 -0.29 -0.18 2.36
A contingency table summarizes the observed values for two categorical variables. We can create such a summary table of observed values for the binary sex, \(\texttt{sex}\), and the coronary artery disease, \(\texttt{cad}\), variables. We start by setting the levels and their labels for the \(\texttt{sex}\) variable.
The xtabs function takes a formula as first argument. The formula is indicated by the ~ symbols. We list the two variable concatenated by the + symbol. Since \(\texttt{sex}\) is the explanatory variable and \(\texttt{sex}\) is the response variable, we use them in the formula in that order. The second argument is data and takes as value the data object that contains the variables.
The resultant table of observed data shows the joint frequencies.
Now that we have a good idea of the information in the data, we enrich our understanding by visualizing the data.
Data visualization
We have seen the comparative summary statistics for the continuous variables. Below, we create box-and-whisker plots for each of these. Box-and-whisker plots are a good choice for visualizing numerical data.
We also see the use of the ggplotly function from the plotly library. We assign the plot to a variable and pass this as argument to the ggpotly function. The result is an interactive plot.
The \(\texttt{age}\) and \(\texttt{cad}\) are considered in Figure 1.
age_cad <- (dfr %>% ggplot2::ggplot(
aes(
x = cad,
y = age
)
) +
ggplot2::geom_boxplot(
aes(
fill = cad
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of age",
subtitle = "Comparison between coronary artery disease groups"
) +
ggplot2::scale_fill_brewer(
palette = "Set1",
direction = -1
) +
ggplot2::xlab("Coronary artery disease") +
ggplot2::ylab("Age") +
ggthemes::theme_clean());
plotly::ggplotly(age_cad)The \(\texttt{restbp}\) and \(\texttt{cad}\) are considered in Figure 2.
restbp_cad <- (dfr %>% ggplot2::ggplot(
aes(
x = cad,
y = restbp
)
) +
ggplot2::geom_boxplot(
aes(
fill = cad
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of resting blood presure",
subtitle = "Comparison between coronary artery disease groups"
) +
ggplot2::scale_fill_brewer(
palette = "Set1",
direction = -1
) +
ggplot2::xlab("Coronary artery disease") +
ggplot2::ylab("Resting blood pressure") +
ggthemes::theme_clean());
plotly::ggplotly(restbp_cad)The \(\texttt{chol}\) and \(\texttt{cad}\) are considered in Figure 3.
chol_cad <- (dfr %>% ggplot2::ggplot(
aes(
x = cad,
y = chol
)
) +
ggplot2::geom_boxplot(
aes(
fill = cad
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of cholesterols levels",
subtitle = "Comparison between coronary artery disease groups"
) +
ggplot2::scale_fill_brewer(
palette = "Set1",
direction = -1
) +
ggplot2::xlab("Coronary artery disease") +
ggplot2::ylab("Cholesterol") +
ggthemes::theme_clean());
plotly::ggplotly(chol_cad)The \(\texttt{maxhr}\) and \(\texttt{cad}\) are considered in Figure 4.
maxhr_cad <- (dfr %>% ggplot2::ggplot(
aes(
x = cad,
y = maxhr
)
) +
ggplot2::geom_boxplot(
aes(
fill = cad
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of maximum heart rate",
subtitle = "Comparison between coronary artery disease groups"
) +
ggplot2::scale_fill_brewer(
palette = "Set1",
direction = -1
) +
ggplot2::xlab("Coronary artery disease") +
ggplot2::ylab("Maximum heart rate") +
ggthemes::theme_clean());
plotly::ggplotly(maxhr_cad)A scatter plot visualizes the correlation between two continuous variables. Below, we use the ggstatsplot library to create a visual representation of the correlation between the \(\texttt{age}\) and \(\texttt{chol}\) variables.
The \(\texttt{age}\) and \(\texttt{chol}\) are considered in Figure 5.
A mosaic plot is a visual representation of a contingency table or table of observed values. In Figure 6, we visualize the joint proportions of the \(\texttt{sex}\) and \(\texttt{cad}\) variables.
Inferential statistics
We are finally ready to answer the research questions. We will start with the secondary research question.
Is there a difference in the numerical variable between those with and without coronary artery disease?
We recalculate the sample mean and standard deviation of the values in the age column below.
# A tibble: 2 × 3
cad Average StandardDeviation
<fct> <dbl> <dbl>
1 No 53.5 9.23
2 Yes 57.5 7.86
Comparing a continuous variable between two groups can be conducted using a t test or a Mann-Whitney-U test. We use the latter non-parametric test if the assumptions for the use parametric tests are not met.
There are many assumptions for the use of parametric tests. Here, we will use the Shapiro-Wilk test to determine if the continuous variable is from a population in which the values are normally distributed. Under the null hypothesis for this test, the variable can be described by a normal distribution. We will set a level of significance of \(\alpha=0.05\) throughout.
Below, we use the filter and select verbs from the dplyr library and the the pull function that returns a vector of values. We use a pipeline to pipe the vector to the shapiro.wilk function. We start with the ages of those without coronary artery disease.
dfr %>%
filter(cad == "No") %>% # We have to use the label value that we set before
select(age) %>%
pull() %>%
shapiro.test()
Shapiro-Wilk normality test
data: .
W = 0.9885, p-value = 0.0777
We fail to reject the null hypothesis. Now, we do the same test for those with coronary artery disease.
dfr %>%
filter(cad == "Yes") %>% # We have to use the label value that we set before
select(age) %>%
pull() %>%
shapiro.test()
Shapiro-Wilk normality test
data: .
W = 0.97065, p-value = 0.05436
In both cases, we fail to reject the null hypothesis. We can state that the variable is normally distributed in the population.
The other test that we will perform, is Bartlett’s test. The null hypothesis is that we have equal variance for the continuous variable comparing the two groups. The bartlett.test function performs this test. We pass the two variables as argument. The Levene’s test is an alternative test that we can use if the data are not normally distributed.
Bartlett test of homogeneity of variances
data: dfr$age and dfr$cad
Bartlett's K-squared = 2.8812, df = 1, p-value = 0.08962
We fail to reject the null hypothesis and can use an equal variance t test, which is Student’s t test. This test is performed using the t.test function. Below, we use the formula argument, set to age ~ cad. This formula states that we want to compares the ages between the two levels of the \(\texttt{cad}\) variable. We also set the var.equal argument to TRUE (as a result of the result from Bartlett’s test).
t.test(
formula = age ~ cad,
data = dfr,
alternative = "two.sided", # The default two-sided alternative hypothesis
var.equal = TRUE # The default is FALSE and performs Welch's test
)
Two Sample t-test
data: age by cad
t = -3.5407, df = 299, p-value = 0.0004627
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-6.303966 -1.799825
sample estimates:
mean in group No mean in group Yes
53.45413 57.50602
The result shows a t statistic of \(-3.5407\). Below, we use the qt function to determine the (negative) critical t value for \(301-2=299\) degrees of freedom and for \(\alpha=0.05\).
The t statistic is smaller than the critical t value. We also see a p value of \(0.0004627\). We have that \(p < \alpha\) and we reject the null hypothesis.
We conclude that there is enough evidence in the data at the \(5\%\) level of significance to state that there is a difference between the ages of those with and without coronary artery disease.
Is binary sex associated with coronary artery disease?
Both \(\texttt{sex}\) and \(\texttt{cad}\) are binary variables. We can use Pearson’s \(\chi^{2}\) test (also known as the \(\chi^{2}\) test of independence) to determine if there is an association between the two variable. The chisq.test function performs the analysis. We pass the table of observed values as argument. We set the correct argument to FASLE so as not to perform Yates’ continuity correction.
Pearson's Chi-squared test
data: xtabs(~sex + cad, data = dfr)
X-squared = 8.7979, df = 1, p-value = 0.003016
The critical \(X^{2}\) statistic is calculate below, using the qchisq function for a single degree of freedom and \(\alpha=0.05\).
The \(X^{2}\) statistic is much larger than the critical value. We also see a p value that is less than our chosen level of significance.
We will look at a single assumption for the use of our results. We require all the expected values (joint frequencies) to be at least \(5\). We do this by using the same test as above, but expressing the expected attribute using $ notation.
cad
sex No Yes
Female 70.25249 26.74751
Male 147.74751 56.25249
All the expected values under the null hypothesis of independence are at least \(5\) and the assumption is met.
We can conclude that there is enough evidence in the data at the \(5\%\) level of significance to show that the outcome, coronary artery disease, is dependent on binary sex.
The odds ratio can also be expressed as a measure of the association between the two binary variables. If Male is our exposure level and Yes is our disease level, the we can pass the xtabs function as argument to the oddsratio.wald function from the epitools library.
$data
cad
sex No Yes Total
Female 81 16 97
Male 137 67 204
Total 218 83 301
$measure
odds ratio with 95% C.I.
sex estimate lower upper
Female 1.000000 NA NA
Male 2.475821 1.344367 4.559535
$p.value
two-sided
sex midp.exact fisher.exact chi.square
Female NA NA NA
Male 0.002558549 0.003572185 0.00301579
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
We note that the odds for coronary artery disease is \(2.475821\) higher in men when compared to women. The \(95\%\) confidence interval is \(1.344367 - 4.559535\). The odds ratio of \(1\) under the null hypothesis is not in this interval. From this, we can also conclude that there is an association between binary sex and coronary artery disease.
Is there correlation between age and cholesterol values?
We can do a correlation test or better still, create a linear regression model, with \(\texttt{age}\) as explanatory variable and \(\texttt{chol}\) as response variable.
Below, we remind ourselves of the mean value for each variable.
# Mean age and cholesterol values
dfr %>%
summarize(
Average_age = mean(age),
Average_cholestrol = mean(chol)
)# A tibble: 1 × 2
Average_age Average_cholestrol
<dbl> <dbl>
1 54.6 NA
We note that the mean cholesterol value is reported as NA. This is because there are missing data for this variable. While many functions in R will deal with missing data by default, this is not so for the summarize function in the dplyr library. Instead, we have to specify what to do with the missing data. Below, we simply remove the missing values, setting the na.rm argument to TRUE.
# A tibble: 1 × 2
Average_age Average_cholestrol
<dbl> <dbl>
1 54.6 247.
We now see the two means.
A linear model is created using lm function. The formula states the dependent variable and the dependent variable. The model is assigned to the variable agechol, which is passed to the summary function.
Call:
lm(formula = chol ~ age, data = dfr)
Residuals:
Min 1Q Median 3Q Max
-123.468 -33.239 -6.417 27.980 303.512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 186.6538 18.1527 10.282 < 2e-16 ***
age 1.1020 0.3282 3.358 0.000888 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 51.4 on 298 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.03646, Adjusted R-squared: 0.03322
F-statistic: 11.27 on 1 and 298 DF, p-value: 0.0008878
The estimate, \(\hat{\beta}_{1}=1.102\). For every \(1\) year increase in age, we see an average \(1.102\) unit increase cholesterol. The coefficient of determination is only \(0.03322\), i.e. the regression model only explains \(3.3\%\) of the variation in cholesterol. We do see a p value that is less than \(0.05\) and we conclude that there is enough evidence at the \(5/%\) level of significance to show that there is a linear association between age and cholesterol value.
Primary research question
To answer the primary research question, we require a logistic regression model. We start with an omnibus test using a maximum likelihood ratio test to determine if at least one of the explanatory variables are associated with the response variable.
To do this, we require the full model and a nested model, with only the intercept. The glm function creates a logistic regression model if the family argument is set to binomial with the link function set to "logit", the logistic link function.
First, we create a new data.frame object by removing all observations with missing data.
Below, we create the model.
Now we use the lrtest function from the lmtest library to conduct the likelihood ratio test.
Likelihood ratio test
Model 1: cad ~ 1
Model 2: cad ~ age + sex + restbp + chol + maxhr
#Df LogLik Df Chisq Pr(>Chisq)
1 1 -174.03
2 6 -141.31 5 65.435 9.104e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see a very large test statistic and a p value that is less than \(0.05\). We can conclude that there is enough evidence at the \(5/%\) level of significance to show that at least one of the explanatory variables is associated with the \(\texttt{chol}\) variable.
We use the summary function to see the results of the logistic regression model.
Call:
glm(formula = cad ~ age + sex + restbp + chol + maxhr, family = binomial(link = "logit"),
data = dfr_no_na)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.584705 1.890700 -0.309 0.75713
age 0.015643 0.018932 0.826 0.40865
sexMale 1.136090 0.359257 3.162 0.00157 **
restbp 0.017121 0.008509 2.012 0.04420 *
chol 0.005954 0.002839 2.097 0.03597 *
maxhr -0.039994 0.007225 -5.536 3.1e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 348.05 on 293 degrees of freedom
Residual deviance: 282.62 on 288 degrees of freedom
AIC: 294.62
Number of Fisher Scoring iterations: 5
From the Wald test statistics, we see that there is an association between each of the explanatory variables and the response variable (while adjusting), except for the \(\texttt{age}\) variable. We can view the odds ratios by exponentiating the coefficients. The latter can be expressed using the coefficients attribute and $ notation. The exp function performs the exponentiation.
(Intercept) age sexMale restbp chol maxhr
0.5572702 1.0157661 3.1145668 1.0172684 1.0059721 0.9607955
We can also return the confidence intervals (\(95/%\) by default) using the confint function and passing the model as argument. The exp function is once again used to exponentiate the results.
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.01321579 22.5054740
age 0.97886145 1.0545864
sexMale 1.57604693 6.4861902
restbp 1.00055012 1.0346653
chol 1.00033218 1.0116236
maxhr 0.94677470 0.9740642
We note that the odds ratio of \(1\) (under the null hypothesis) is in the confidence interval of the \(\texttt{age}\) variable.
Conclusion
We have used a trivial research example and question to demonstrate the workflow of a typical research project.
R is capable of working with data, analyzing data, creating plots, and allow us t draw conclusions from a variety of statistical test.
A R markdown file or a Quarto document are ideal tools for creating a computational essay that we can arhive or share with otehrs.